perm filename MULTID[4,KMC]1 blob sn#025674 filedate 1973-02-16 generic text, type T, neo UTF8
00100		MULTIDIMENSIONAL ANALYSIS IN EVALUATING  A SIMULATION
00200		       OF PARANOID THOUGHT PROCESSES
00300	
00400	               KENNETH MARK COLBY
00500	
00600	
00700	
00800		Once a simulation model reaches a stage of intuitive adequacy
00900	based  on  face validity, the model builder then considers using more
01000	stringent evaluation procedures depending on the purposes  the  model
01100	is  intended  to  serve.  If  the  model  is  to  serve  a  practical
01200	application , for example as a training  device,  then  a  rough  and
01300	ready  approximation  may be sufficient.  But when the model is being
01400	proposed as a theoretical explantion of a psychological process, more
01500	is demanded of the representation than face validity.
01600		A computer  simulation  model  consists  of  a  structure  of
01700	hypothetical  mechanisms  or  procedures  sufficient  to generate the
01800	input-output behavior under consideration. The theory embodied in the
01900	model  can  be  made  clear  by  statements  which  describe  how the
02000	postulated structure reacts under various circumstances. I shall  not
02100	describe  a  theory or model of paranoid processes here, but rather I
02200	shall  concentrate  on  the  evaluation  problem   which   asks   the
02300	disarmingly  simple  question `how good is the model?' While the term
02400	`good' has many senses  in  ordinary  language,  I  shall  take  this
02500	question  to  mean  `how  close  is  the  correspondence  between the
02600	behavior of the model and the phenomenena it is intended to explain?'
02700	Turing's  Test  has  often been suggested as an aid in answering this
02800	question for computer models  but  as  far  as  I  know  no  one  has
02900	conducted a true version of this test.
03000		It is very easy to become confused about  Turing's  Test.  In
03100	part  this  is  due  to  Turing himself who introduced the now-famous
03200	imitation game in a  1950  paper  entitled  COMPUTING  MACHINERY  AND
03300	INTELLIGENCE [3 ].  A careful reading of this paper reveals there are
03400	actually two games proposed , the second of which is commonly  called
03500	Turing's test.
03600		In the first imitation game  two  groups  of  judges  try  to
03700	determine which of two interviewees is a woman. Communication between
03800	judge and  interviewee  is  by  teletype.  Each  judge  is  initially
03900	informed  that  one  of the interviewees is a woman and one a man who
04000	will pretend to be a woman. After the interview, the judge  is  asked
04100	what  we shall call the woman-question i.e. which interviewee was the
04200	woman?  Turing does not say what else  the  judge  is  told  but  one
04300	assumes  the  judge is NOT told that a computer is involved nor is he
04400	asked to determine which  interviewee  is  human  and  which  is  the
04500	computer.  Thus,  the  first  group  of  judges  would  interview two
04600	interviewees:    a woman, and a man pretending to be a woman.
04700		The  second  group  of judges would be given the same initial
04800	instructions, but unbeknownst to them, the two interviewees would  be
04900	a  woman  and a computer programmed to imitate a woman.   Both groups
05000	of judges  play  this  game  until  sufficient  statistical  data are
05100	collected  to  show  how  often the right identification is made. The
05200	crucial question then is:  do the judges decide wrongly AS OFTEN when
05300	the  game  is  played  with man and woman as when it is played with a
05400	computer substituted  for  the  man.  If  so,  then  the  program  is
05500	considered  to  have  succeeded in imitating a woman as well as a man
05600	imitating  a  woman.    For  emphasis  we  repeat;  in   asking   the
05700	woman-question  in  this  game,  judges  are not required to identify
05800	which interviewee is human and which is machine.
05900		Later  on  in  his  paper  Turing proposes a variation of the
06000	first game. In the second game one interviewee is a man and one is  a
06100	computer.   The judge is asked to determine which is man and which is
06200	machine, which we shall call the machine-question. It is this version
06300	of  the game which is commonly thought of as Turing's test.    It has
06400	often been suggested as a means of validating computer simulations of
06500	psychological processes.
06600		In the course of testing a  simulation  (PARRY)  of  paranoid
06700	linguistic behavior in a psychiatric interview, we conducted a number
06800	of   Turing-like   indistinguishability   tests   [1].      We    say
06900	`Turing-like' because none of them consisted of playing the two games
07000	described above. We chose not to play these games  for  a  number  of
07100	reasons  which  can  be  summarized  by  saying that they do not meet
07200	modern criteria for good experimental design.  In designing our tests
07300	we  were  primarily  interested in learning more about developing the
07400	model.  We did not believe the simple machine-question to be a useful
07500	one   in   serving   the  purpose  of  progressively  increasing  the
07600	credibility of the model but we investigated a  variation  of  it  to
07700	satisfy the curiosity of colleagues in artificial intelligence.
07800		In this design eight psychiatrists  interviewed  by  teletype
07900	two  patients,  one  being PARRY and one being an actual
08000	hospitalized paranoid patient.   The interviewers were  not  informed
08100	that  a simulation was involved nor were they asked to identify which
08200	was the machine. Their task was to conduct a  diagnostic  psychiatric
08300	interview  and  rate  each  response  from the `patients' along a 0-9
08400	scale  of  paranoidness,  0  meaning  zero  and  9   being   highest.
08500	Transcripts   of   these  interviews,  without  the  ratings  of  the
08600	interviewers, were then utilized for  various  experiments  in  which
08700	randomly   selected   expert  judges  conducted  evaluations  of  the
08800	interview transcripts.   For example, in one experiment it was  found
08900	that patients and model were indistinguishable along the dimension of
09000	paranoidness.
09100		To ask the machine-question, we sent  interview  transcripts,
09200	one  with a patient and one with PARRY, to 100 psychiatrists randomly
09300	selected from the Directory of American Specialists and the Directory
09400	of  the  American Psychiatric Association. Of the 41 replies 21 (51%)
09500	made the correct identification while 20 (49%) were wrong.  Based  on
09600	this  random  sample of 41 psychiatrists, the 95% confidence interval
09700	is between 35.9 and 66.5, a range which  is  close  to  chance.  (Our
09800	statistical   consultant   was  Dr.   Helena  C.   Kraemer,  Research
09900	Associate  in  Biostatistics,  Department  of  Psychiatry,   Stanford
10000	University.)
10100		Psychiatrists are expert judges of patient interview behavior
10200	but  they  are unfamiliar with computers. Hence we conducted the same
10300	test  with  100  computer  scientists  randomly  selected  from   the
10400	membership list of the Association for Computing Machinery, ACM.   Of
10500	the 67 replies 32 (48%) were right and 35 (52%) were wrong. Based  on
10600	this  random  sample  of  67 computer scientists the   95% confidence
10700	ranges from  36 to 60, again close to a chance level.
10800		Thus the answer to this machine-question "can expert  judges,
10900	psychiatrists  aand  computer scientists, using teletyped transcripts
11000	of psychiatric interviews, distinguish between paranoid patients  and
11100	a  simulation  of paranoid processes? " is "No". But what do we learn
11200	from this?   It is some comfort that the answer was not "yes"and  the
11300	null  hypothesis  (no  differences) failed to be rejected, especially
11400	since statistical tests are somewhat biased in favor of rejecting the
11500	null  hypothesis [3].  Yet this answer does not tell us what we would
11600	most like to know, i.e. how to improve the model.  Simulation  models
11700	do  not spring forth in a complete, perfect and final form; they must
11800	be gradually developed over time. Pehaps  we  might  obtain  a  "yes"
11900	answer to the machine-question if we allowed a large number of expert
12000	judges to conduct the  interviews  themselves  rather  than  studying
12100	transcripts  of  other  interviewers.     It  would indicate that the
12200	model must be improved but unless we systematically investigated  how
12300	the  judges  succeeded in making the discrimination we would not know
12400	what aspects of the model to work on. The logistics of such a  design
12500	are  immense  and obtaining a large N of judges for sound statistical
12600	inference  would  require   an   effort   disproportionate   to   the
12700	information-yield.
12800		A more efficient and informative way to use Turing-like tests
12900	is to ask judges to make ordinal ratings along scaled dimensions from
13000	teletyped  interviews.     We  shall  term  this  approach asking the
13100	dimension-question.   One can then compare scaled ratings received by
13200	the patients and by the model to precisely determine where and by how
13300	much they differ.        Model builders  strive  for  a  model  which
13400	shows     indistinguishability     along    some    dimensions    and
13500	distinguishability along others.  That is, the model converges on what
13600	it is supposed to simulate and diverges from that which it is not.
13700		We  mailed  paired-interview  transcripts  to   another   400
13800	randomly  selected psychiatrists asking them to rate the responses of
13900	the two `patients' along certain dimensions. The judges were  divided
14000	into  groups,  each  judge  being asked to rate responses of each I-O
14100	pair in the interviews along four dimensions.  The  total  number  of
14200	dimensions  in  this  test  were twelve- linguistic noncomprehension,
14300	thought disorder, organic brain syndrome, bizarreness,  anger,  fear,
14400	ideas  of  reference, delusions, mistrust, depression, suspiciousness
14500	and mania. These are dimensions which psychiatrists commonly  use  in
14600	evaluating patients.
14700		Table 1 shows there were significant differences, with  PARRY
14800	receiving   higher   scores   along   the  dimensions  of  linguistic
14900	noncomprehension,thought disorder, bizarreness, anger,  mistrust  and
15000	suspiciousness. On the dimension of delusions the patients were rated
15100	significantly higher. There were no significant differences along the
15200	dimensions  of  organic  brain  syndrome,fear,  ideas  of  reference,
15300	depression and mania.
15400		While    tests    asking    the   machine-question   indicate
15500	indistinguishability at  the  gross  level,  a  study  of  the  finer
15600	structure  os  the  model's  behavior  through  ratings  along scaled
15700	dimensions  shows  statistically  significant   differences   between
15800	patients  and  model.     These  differences are of help to the model
15900	builder in suggesting which aspects of the model must be modified and
16000	improved  in  order  to  be  considered an adequate simulation of the
16100	class of paranoid patients it is intended to simulate.  For  example,
16200	it  is  clear  that  PARRY'S language-comprehension must be improved.
16300	Once this has been implemented, a future test will  tell  us  whether
16400	improvement has occurred and by how much in comparison to the earlier
16500	version.   Successive identification of particular areas  of  failure
16600	in the model permits their improvement and the development of
16700	more adequate model-versions.
16800		Further evidence that the machine-question is too coarse  and
16900	insensitive  a test comes from the following experiment. In this test
17000	we constructed a random version of the paranoid model which  utilized
17100	PARRY'S  output statements but expressed them randomly no matter what
17200	the interviewer said.   Two psychiatrists conducted  interviews  with
17300	this  model, transcripts of which were paired with patient interviews
17400	and sent to 200  randomly  selected  psychiatrists  asking  both  the
17500	machine-question  and  the dimension-question.  Of the 69 replies, 34
17600	(49%) were right and 35 (51%) wrong. Based on this random  sample  of
17700	69  psychiatrists,  the 95% confidence interval ranges from 39 to 63,
17800	again indicating  a  chance  level.  However  as  shown  in  Table  2
17900	significant  differences  appear  along  the dimensions of linguistic
18000	noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
18100	rated  higher.  On  these  particular  dimensions  we can construct a
18200	continuum in which the random version  represents  one  extreme,  the
18300	actual patients another. Our (nonrandom) PARRY lies somewhere between
18400	these two extremes, indicating that it performs significantly  better
18500	than  the  random version but still requires improvement before being
18600	indistinguishable from  patients.(See  Fig.1).  Hence  this  approach
18700	provides  yardsticks  for measuring the adequacy of this or any other
18800	dialogue simulation model along the relevant dimensions.
18810	(Insert comparison of dimensions between PARRY and RANDOM-PARRY)
18900		We  conclude  that  when model builders want to conduct tests
19000	which indicate in which direction  progress  lies  and  to  obtain  a
19100	measure  of  whether  progress  is  being  achieved,  the  way to use
19200	Turing-like tests is to ask  expert  judges  to  make  ratings  along
19300	multiple  dimensions considered essential to the model.  Useful tests
19400	do not prove a model, they probe it  for  its  sensitivities.  Simply
19500	asking   the  machine-question  yields  no  information  relevant  to
19600	improving what the model builder knows is only a first approximation.
19700	
19800	
19900			REFERENCES
20000	
20100	[1]  Colby,  K.M.,  Hilf,F.D., Weber, S. and Kraemer,H.C. Turing-like
20200	indistinguishability  tests  for  the  validation   of   a   computer
20300	simulation   of   paranoid   processes.   ARTIFICIAL  INTELLIGENCE,3,
20400	(1972),199-221.
20500	
20600	[2]  Meehl,  P.E.,  Theory  testing  in  psychology  and  physics:  a
20700	methodological paradox. PHILOSOPHY OF SCIENCE,34,(1967),103-115.
20800	
20900	[3]  Turing,A.  Computing  machinery  and intelligence. Reprinted in:
21000	COMPUTERS  AND  THOUGHT  (Feigenbaum,  E.A.  and  Feldman,  J.,eds.).
21100	McGraw-Hill, New York,1963,pp. 11-35.